The database comes from one of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.
Required Files given in below link.
https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database
The database is basically records of accidents from 12 different plants in 03 different countries which every line in the data is an occurrence of an accident.
Columns description:
Data: timestamp or time/date information
Countries: which country the accident occurred (anonymised)
Local: the city where the manufacturing plant is located (anonymised)
Industry sector: which sector the plant belongs to
import tensorflow as tf
tf.__version__
# Initialize the random number generator
import random
random.seed(0)
# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")
import glob
# import numpy, pandas and other general libraries
import numpy as np
import pandas as pd
import re
import os
# plot the chart
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# encoding
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
# to split Train and Test data
from sklearn.model_selection import train_test_split
# To pad sentence #
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences
# Define the model
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Conv1D, MaxPooling1D
from keras.layers import Activation
from keras.layers import BatchNormalization
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras import regularizers, optimizers
# import optimizer
from tensorflow.keras.optimizers import Adam
# Transformer
from tensorflow.keras import layers
from tensorflow import keras
# for backup
import copy
from google.colab import drive
drive.mount('/content/drive/')
Mount the google drive and map to Colab
#### Set the working directory path where the dataset is stored in Google drive ####
project_path = '/content/drive/My Drive/AIML/Data/capstone/'
Define the project path by specifying the location of the dataset in the google drive
safety_data = pd.read_csv(project_path + "industrial_safety_and_health_database_with_accidents_description.csv")
Read the csv file using pandas read_csv and store the data in safety_data dataframe
print(safety_data.shape)
Safety dataset contains 425 records and 11 attributes
safety_data.head()
Display first 5 records from the loaded file. It has 11 columns namely, Date (timestamp), countries, Local, Industry Sector, Accident Level, Potential Accident Level, Gender, Employee or 3rd party details, Critical risk and description of accident
safety_data.columns
Some of the column names are meaningless. Hence, correct the column names and add column names, if not exists
safety_data.columns =['index', 'Date', 'Country', 'Local', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description']
safety_data.columns
safety_data.head()
#safety_data = safety_data.drop(columns=['index', 'Date'], axis=1)
safety_data = safety_data.drop(columns=['index'], axis=1)
safety_data.columns
safety_data.head()
Column 'Index' dropped from the dataset.
# Check if duplicate data exits
duplicates = safety_data.duplicated()
print('Number of duplicate rows = %d' % (duplicates.sum()))
safety_data = safety_data.drop_duplicates()
print(safety_data.shape)
7 records are duplicated and removed. Now the shape of the dataset is 418 rows and 10 features. these are unique rows
We need to pre-process/make the data ready before training the dataset
safety_data["Description_length"]= safety_data["Description"].str.len()
Added a column called "Description_length" and it stores the length of the Descption details. This helps us to decided the minimum and maximum words for text processing
safety_data['Description_length'].describe()
Average length of accident or incident description is 365 characters. Minimum lenght is 94 characters and max length is 1029 characters
safety_data.head()
maximum = max(safety_data["Description"].str.split().apply(len))
maximum
len(safety_data['Description'][183].split())
There are maximum of 183 words in the "Description". And, 25 records have maximum number of words ie. 183 words.
safety_data['Date'] = pd.to_datetime(safety_data['Date'])
print("Latest Incident Date as per dataset : ", safety_data.Date.max())
print("Oldest Incident Date as per dataset : ", safety_data.Date.min())
The incident records are from 1st Jan 2016 to 9th July 2017
safety_data['Year'] = safety_data['Date'].apply(lambda d : d.year)
safety_data['Month'] = safety_data['Date'].apply(lambda d : d.month)
safety_data['Weekday'] = safety_data['Date'].apply(lambda d : d.day_name())
safety_data['WeekofYear'] = safety_data['Date'].apply(lambda d : d.weekofyear)
safety_data.columns
safety_data.head()
safety_data.info()
Date attributes extracted from date like 'Year, Month, Weekday and Week of the Year' are added as columns in the safety dataset
Here, Not using seasons as the accidents happened in 3 countries
safety_data.loc[(safety_data.Month == 4) | (safety_data.Month == 5) | (safety_data.Month == 6), 'Quarter'] = 'AMJ_QTR1'
safety_data.loc[(safety_data.Month == 7) | (safety_data.Month == 8) | (safety_data.Month == 9), 'Quarter'] = 'JAS_QTR2'
safety_data.loc[(safety_data.Month == 10)| (safety_data.Month == 11)| (safety_data.Month == 12),'Quarter'] = 'OND_QTR3'
safety_data.loc[(safety_data.Month == 1) | (safety_data.Month == 2) | (safety_data.Month == 3), 'Quarter'] = 'JFM_QTR4'
safety_data.info()
New feature "Quarter" is included to the dataset. This helps in identifying quarter where most of the accidents occured
safety_data.head()
Display first 5 rows.
temp_data = pd.DataFrame(safety_data.dtypes)
temp_data['Missing Values'] = safety_data.isnull().sum()
temp_data['Unique Count/Values'] = safety_data.nunique()
temp_data
No null values found in the safety dataset.
This shows, the dataset has
safety_data.info()
Except 'Description Lenght' all other features are object. Based on the type of feature, it needs to converted to categorical variables
safety_data.describe(include=['object'])
Observations:
Need to analyse:
Most of the categorical variables are defined as object. Hence, this needs to be converted to categorical variables
## Change the object/string reporesentation of Accident and potential Accident Level to Integers
pot_acc_level = {'I': 1, 'II': 2,'III': 3 , 'IV' : 4, 'V': 5, 'VI' : 6}
safety_data['Accident Level'] = pd.Series([pot_acc_level[x] for x in safety_data['Accident Level']], index=safety_data.index)
safety_data['Potential Accident Level'] = pd.Series([pot_acc_level[x] for x in safety_data['Potential Accident Level']], index=safety_data.index)
safety_data.groupby('Potential Accident Level').size()
The distribution for 'Potential Accident Level' is imbalanced. We can do 2 options:
Here, we will be removing this outlier record for processing and perform SMOTE to upsample the data to make it balanced.
safety_data.groupby('Accident Level').size()
The distribution for 'Accident Level' is imbalanced.
safety_data.info()
# Replacing special symbols in 'Description' column
# re stands for Regular Expression
safety_data['Description'] = safety_data['Description'].apply(lambda s : re.sub('[^a-zA-Z0-9]', ' ', s))
safety_data["Description_length_New"]= safety_data["Description"].str.len()
Removed special symbols from 'Description' column as this will have significantly less impact to description of the incidents
safety_data.head()
All characters apart from lower case, upper case and digits replaced with spaces. Special symbols removed from 'Description' attributes
safety_data.loc[7:13, ['Description_length', 'Description_length_New']].describe()
No special Characters found in the description attributes
safety_data = safety_data.drop(columns=['Description_length', 'Description_length_New'], axis=1)
# copy and keep the original dataset - Anaysis Base Table (ABT)
safety_bkup_df = copy.deepcopy(safety_data)
safety_df = safety_data.drop(columns=['Description'], axis=1)
plt.figure(figsize= (30,20)) # Set the figure size
pos = 1 # a variable to manage the position of the subplot in the overall plot
for feature in safety_df.columns: # for-loop to iterate over every attribute whose distribution is to be visualized
plt.subplot(5, 3, pos) # plot grid
# Plot bar chart for categorical label 'status'
sns.countplot(safety_df[feature])
pos += 1 # to plot over the grid one by one
Observations:
safety_data['Critical Risk'].unique()
safety_data.groupby(['Critical Risk'], sort=True).size()
risk_type = safety_data.groupby('Critical Risk').count().sort_values(by=['Date'], ascending = True).reset_index()
plt.figure(figsize=(20,10))
plt.barh(risk_type['Critical Risk'],risk_type['Date'])
plt.xticks(rotation = 'horizontal')
import plotly.express as px
fig = px.pie(safety_data, names='Country', template='seaborn')
fig.update_traces(rotation=45, pull=[0.1,0.01,0.1,0.01,0.1], textinfo="percent+label", showlegend=True)
fig.show()
It shows Couontry_01 is highly impacted followed by Country_02 and Country_03.
Approximatelly 60% of accidents happends in Country_01. We need to understand what type of accidents happen in Country_01
fig = px.pie(safety_data, names='Gender', template='seaborn')
fig.update_traces(rotation=45, pull=[0.1,0.01,0.1,0.01,0.1],textinfo="percent+label", showlegend=True)
fig.show()
It shows 95% of plant accidents or injuries happened to Male employees
fig = px.pie(safety_data, names='Accident Level', template='seaborn')
fig.update_traces(rotation=45, pull=[0.1,0.01,0.1,0.01,0.1],textinfo="percent+label", showlegend=True)
fig.show()
Accident Level I occured 73% of times
fig = px.pie(safety_data, names='Potential Accident Level', template='seaborn')
fig.update_traces(rotation=45, pull=[0.1,0.01,0.1,0.01,0.1],textinfo="percent+label", showlegend=True)
fig.show()
But, potential accident level shows Type IV ie more severe accident (33%), that means 1/3 of accident could be potenial to have Type-IV accident level
safety_data.groupby(['Country']).size()
Country_01 has appx 60% of incidents
safety_data.groupby(['Country', 'Accident Level']).size()
42% of overall incidents was in Accident Level I and happend in Country_01
def chart_Vs_AccLevel(data, feature):
fig = plt.figure(figsize = (20, 10))
ax = fig.add_subplot(121)
sns.countplot(x = feature, data = data, ax = ax, orient = 'v',
hue = 'Accident Level').set_title(feature.capitalize() +' count plot by Accident Level', fontsize = 13)
plt.legend(labels = data['Accident Level'].unique())
plt.xticks(rotation = 90)
ax = fig.add_subplot(122)
sns.countplot(x = feature, data = data, ax = ax, orient = 'v',
hue = 'Potential Accident Level').set_title(feature.capitalize() +' count plot by Potential Accident Level',
fontsize = 13)
plt.legend(labels = data['Potential Accident Level'].unique())
plt.xticks(rotation = 90)
return plt.show()
chart_Vs_AccLevel(safety_data, 'Country')
Accident Level:
Potential Accident Level:
chart_Vs_AccLevel(safety_data, 'Local')
#safety_data.groupby(['Country', 'Local', 'Accident Level']).size()
safety_data.groupby(['Country', 'Local']).size()
Location_03 which is in Country_01 has experienced more accidents followed by Location_05 in Country_02
chart_Vs_AccLevel(safety_data,'Industry Sector')
#safety_data.groupby(['Country', 'Accident Level', 'Local', 'Industry Sector']).size()
safety_data.groupby(['Country', 'Local', 'Industry Sector']).size()
In Country_01, it shows clearly "Mining Industry Sector" is the highly accident prone area and Safety preventive actions to be taken care......... In Country_02, "Metals Industry Sector" has experienced more accidents
safety_data.groupby(['Country', 'Critical Risk']).size()
In Country_01 and Country_02, majority of the critical risks are mentioned as "Others". But in Country_03, risks like 'Bees and Venomous Animals' are given but the 'Industry sector' is not clear as it is mentioned as "Others"
safety_data.groupby(['Country', 'Gender','Accident Level']).size()
Male employees were involved in 41% of overall incidents/accidents in the categroy of Accident Level I and occured in Country_01
safety_data.groupby(['Country', 'Local', 'Industry Sector','Employee Type', 'Gender']).size()
chart_Vs_AccLevel(safety_data, 'Gender')
safety_data.groupby(['Country', 'Local', 'Industry Sector','Employee Type']).size()
chart_Vs_AccLevel(safety_data,'Employee Type')
sns.factorplot(x='Year', y='Potential Accident Level', data=safety_data, hue='Industry Sector', aspect=2, size=4)
Factor plot shows potential accidents remain the same for both the years. Only Mining has slight increase
sns.factorplot(x='Year', y='Accident Level', data=safety_data, hue='Industry Sector', aspect=2, size=4)
Factor plot shows, Accident level significantly reduced in other sectors. But, there is a slight increase in 2017 as comparted to 2016
piv_accident_level =safety_data.pivot_table(index='Month', columns=[ 'Year','Accident Level'], aggfunc='count')['Country']
piv_accident_level
piv_potential_level =safety_data.pivot_table(index='Month', columns=[ 'Year','Potential Accident Level'], aggfunc='count')['Country']
piv_potential_level
fig = plt.figure(figsize=(20,7))
ax = fig.add_subplot(2, 2, 1)
piv_accident_level[2016].plot(kind='bar', ax=ax, width=0.9, cmap='cool', title='2016 Accident Levels')
plt.legend(bbox_to_anchor=(0.9, 1), loc=1, borderaxespad=0.)
ax = fig.add_subplot(2, 2, 2)
piv_accident_level[2017].plot(kind='bar', ax=ax, width=0.9, cmap='cool', title='2017 Accident Levels')
plt.legend(bbox_to_anchor=(0.9, 1), loc=1, borderaxespad=0.)
ax = fig.add_subplot(2, 2, 3)
piv_potential_level[2016].plot(kind='bar', ax=ax, width=0.9, cmap='cool', title='2016 Potential Accident Levels')
plt.legend(bbox_to_anchor=(0.9, 1), loc=1, borderaxespad=0.)
ax = fig.add_subplot(2, 2, 4)
piv_potential_level[2017].plot(kind='bar', ax=ax, width=0.9, cmap='cool', title='2017 Potential Accident Levels')
plt.legend(bbox_to_anchor=(0.9, 1), loc=1, borderaxespad=0.)
Above bar chart shows month-wise accident level and potential accident level distribution. In the year 2017, we have only data for 7 months.
This clearly shows, accident level is reduced when compared to potential accident level.
Most of the Accidents are Level I and occurred thru out the year eventhough the potential level of accidents is high.
import holoviews as hv
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
ac_level_cnt = np.round(safety_data['Accident Level'].value_counts(normalize=True) * 100)
pot_ac_level_cnt = np.round(safety_data['Potential Accident Level'].value_counts(normalize=True) * 100, decimals=1)
ac_pot = pd.concat([ac_level_cnt, pot_ac_level_cnt], axis=1,sort=False).fillna(0).rename(columns={'Accident Level':'Accident', 'Potential Accident Level':'Potential'})
ac_pot = pd.melt(ac_pot.reset_index(), ['index']).rename(columns={'index':'Severity', 'variable':'Levels'})
hv.Bars(ac_pot, ['Severity', 'Levels'], 'value').opts(opts.Bars(title="Accident Levels Count", width=700, height=300,tools=['hover'],\
show_grid=True,xrotation=45, ylabel="Percentage", yformatter='%d%%'))
This bar chart shows the comparison of Accident Level Vs Potential Accident Level
In Country_01, Appx 80% of accidents occured in Mining industry sector. Mining Industry operats in location_01, location_03, location_04.
There could reasons like
With the limited data, it would be difficult to identify the reason
Yes, Data shows accident level is reduced as compared to potential accident levels. Type IV accidents happended in Country_01 (Mining) could be prevented
Industry Sector in this region is Mining
safety_data.groupby(['Country', 'Local', 'Industry Sector','Quarter']).size()
Data shows significantly more number of accidents occured during JFM Quarter. There could be reason more work was done during last quarter.
safety_data.groupby(['Weekday','Country', 'Industry Sector']).size()
Based on the data, accident count is more on 'Friday' and 'Saturday' in the Mining Sector followed by Thursday.
As the data given is for only 1 hour, we can't assue these Weekdays and Quarter will have any direct impact on the accident level or potential accident level.
safety_data.info()
#Take a copy and apply encoding
safety_data_encod = copy.deepcopy(safety_data)
safety_data_encod['Country'] = LabelEncoder().fit_transform(safety_data_encod['Country'])
safety_data_encod['Local'] = LabelEncoder().fit_transform(safety_data_encod['Local'])
safety_data_encod['Industry Sector'] = LabelEncoder().fit_transform(safety_data_encod['Industry Sector'])
safety_data_encod['Gender'] = LabelEncoder().fit_transform(safety_data_encod['Gender'])
safety_data_encod['Employee Type'] = LabelEncoder().fit_transform(safety_data_encod['Employee Type'])
safety_data_encod['Critical Risk'] = LabelEncoder().fit_transform(safety_data_encod['Critical Risk'])
safety_data_encod['Quarter'] = LabelEncoder().fit_transform(safety_data_encod['Quarter'])
safety_data_encod['Weekday'].unique()
## Change the object/string reporesentation of Weekday to Integers
weekday_t = {'Monday': 1, 'Tuesday': 2,'Wednesday': 3 , 'Thursday' : 4, 'Friday': 5, 'Saturday' : 6, 'Sunday' : 7}
safety_data_encod['Weekday'] = pd.Series([weekday_t[x] for x in safety_data_encod['Weekday']], index=safety_data_encod.index)
safety_data_encod['Weekday'].unique()
safety_data_encod.info()
#safety_data_encod = safety_data_encod.drop(columns=['Description_length','Month','Year','WeekofYear'], axis=1)
safety_data_encod = safety_data_encod.drop(columns=['Month','Year','WeekofYear'], axis=1)
from sklearn.feature_selection import f_classif, chi2, mutual_info_classif
from statsmodels.stats.multicomp import pairwise_tukeyhsd
X = safety_data_encod.drop(columns=['Accident Level', 'Potential Accident Level','Date', 'Description'], axis=1)
y = safety_data_encod['Potential Accident Level']
To select features that have the strongest relationship with the output variable. chi-squared (chi²) statistical test for non-negative features to select the best features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#apply SelectKBest class to extract top best features
bestfeatures = SelectKBest(score_func=chi2, k=7)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(7,'Score')) #print 10 best features
# Correlation matrix
corr1 = safety_data_encod.corr()
mask1=np.zeros_like(corr1);
mask1[np.triu_indices_from(mask1, 1)] = True
plt.figure(figsize=(16,10))
sns.heatmap(corr1,annot=True, fmt = '.2f', mask=mask1)
Here are the observations:
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
Irrelevant or partially relevant features can negatively impact model performance. Feature selection and Data cleaning should be the first and most important step of your model designing. Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable
Here, country has high feature importance followed by Industry Sector, Gender, Local, Critical Risk, Weekday, Quarter and Employee Type
safety_data_encod.info()
#Take a copy for creating featured data for ML
safety_data_ml = copy.deepcopy(safety_data)
safety_data_nlp = copy.deepcopy(safety_data)
safety_data_ml['Country'] = LabelEncoder().fit_transform(safety_data_ml['Country'])
safety_data_ml['Local'] = LabelEncoder().fit_transform(safety_data_ml['Local'])
safety_data_ml['Critical Risk'] = LabelEncoder().fit_transform(safety_data_ml['Critical Risk'])
## Change the object/string reporesentation of Weekday to Integers
weekday_t = {'Monday': 1, 'Tuesday': 2,'Wednesday': 3 , 'Thursday' : 4, 'Friday': 5, 'Saturday' : 6, 'Sunday' : 7}
emp_type_e = {'Employee' : 1, 'Third Party' : 2, 'Third Party (Remote)' :3}
ind_sector_e = {'Mining' :1 , 'Metals' : 2 , 'Others': 3 }
gender_e = { 'Male' : 1, 'Female':2}
quarter_e = {'AMJ_QTR1': 1, 'JAS_QTR2' : 2, 'OND_QTR3' : 3, 'JFM_QTR4': 4}
safety_data_ml['Weekday'] = pd.Series([weekday_t[x] for x in safety_data_ml['Weekday']], index=safety_data_ml.index)
safety_data_ml['Employee Type'] = pd.Series([emp_type_e[x] for x in safety_data_ml['Employee Type']], index=safety_data_ml.index)
safety_data_ml['Industry Sector'] = pd.Series([ind_sector_e[x] for x in safety_data_ml['Industry Sector']], index=safety_data_ml.index)
safety_data_ml['Gender'] = pd.Series([gender_e[x] for x in safety_data_ml['Gender']], index=safety_data_ml.index)
safety_data_ml['Quarter'] = pd.Series([quarter_e[x] for x in safety_data_ml['Quarter']], index=safety_data_ml.index)
'Month', 'Year', 'WeekofYear'
safety_data_ml = safety_data_ml.drop(columns=['Date','Month','Year','WeekofYear'], axis=1)
column_names =['Country', 'Industry Sector', 'Gender' , 'Local' , 'Critical Risk', 'Weekday', 'Quarter', 'Employee Type', 'Description', 'Description_length', 'Accident Level', 'Potential Accident Level']
safety_data_ml = safety_data_ml.reindex(columns=column_names)
safety_data_ml.info()
safety_data_nlp.info()
safety_data_nlp = safety_data_nlp.drop(columns=['Date','Month','Year','WeekofYear'], axis=1)
safety_data_nlp = safety_data_nlp.drop(columns=['Weekday', 'Quarter'], axis=1)
safety_data_nlp.info()
safety_data_nlp = safety_data_nlp.reindex(columns=column_names)
filename_ml = 'safety_data_ml.csv'
filename_nlp = 'safety_data_nlp.csv'
safety_data_ml.to_csv(project_path+filename_ml, index = False)
safety_data_nlp.to_csv(project_path+filename_nlp, index = False)
List of NLP Pre-Processing
import re
import nltk
import spacy
import string
from collections import Counter
from nltk.stem import WordNetLemmatizer
pd.options.mode.chained_assignment = None
import nltk
nltk.download('stopwords')
nltk.download('brown')
nltk.download('names')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.util import ngrams
from wordcloud import WordCloud, STOPWORDS
!pip install normalise
from normalise import normalise
import en_core_web_sm
nlp = en_core_web_sm.load()
class TextProcessor():
def __init__(self, text_df):
self.lemmatizer = WordNetLemmatizer()
cnt = Counter()
for text in text_df["Description"].values:
for word in text.split():
if word.lower() not in stopwords.words('english'):
cnt[word] += 1
n_words = 10
self.most_frequent_words = set([w for (w, wc) in cnt.most_common(n_words)])
print(f'Top {n_words} frequent words : {self.most_frequent_words}')
self.most_infrequent_words = set([w for (w, wc) in cnt.most_common()[:-n_words-1:-1]])
print(f'Top {n_words} rare words : {self.most_infrequent_words}')
def remove_punctuation(self, text):
return text.translate(str.maketrans(' ', ' ', string.punctuation))
#def remove_names(self, text):
# # print(text)
# orig_words_list = text.split()
# tagged_sentence = nltk.tag.pos_tag(orig_words_list)
# word_list = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS']
# print(f'Removed proper noun(s) : {set(orig_words_list)-set(word_list)}')
# return ' '.join(word for word in word_list)
def remove_words(self, text, removable_words):
return " ".join([word for word in text.split() if word not in removable_words])
def preprocess(self, text_df):
#print("Removing proper nouns")
# remove names - like Anthony, cristóbal, eduardo eric fernández
# TODO - check whether this is removing too many words, especially the ones starting with capital letter
#text_df["Description"] = text_df["Description"].apply(lambda text: self.remove_names(text))
print("Converting to lower case")
text_df["Description"] = text_df["Description"].str.lower()
print("Removing standard punctuations")
text_df["Description"] = text_df["Description"].apply(lambda text: self.remove_punctuation(text))
print("Removing Stopwords")
EXCLUDED_REMOVE_WORDS={'hand'}
rem_words_set = {"x", "cm", "kg", "mr", "nv", "da", "pm", "am", "cx" }
new_words_remove ={ "cause", "employee", "activity","right", "leave",
"worker","operator", "collaborator",
"one", "two", "second", "third",
"generate", "right", "time", "perform", "moment",
"assistant", "approximate", "describe", "mechanic", "company", "work", "support"}
# remove frequent words that not contribute to model
# words_to_remove = rem_words_set.union(set(stopwords.words('english'))).union(self.most_frequent_words).union(self.most_infrequent_words).difference(EXCLUDED_REMOVE_WORDS)
words_to_remove = rem_words_set.union(set(stopwords.words('english')))
print("expected words to remove ", words_to_remove)
print(f"Removing {words_to_remove}")
text_df["Description"] = text_df["Description"].apply(lambda text: self.remove_words(text, words_to_remove))
print("Lemmatizing")
text_df["Description"] = text_df["Description"].apply(lambda text: ' '.join([t.lemma_ for t in nlp(text)]))
print("Removing words containing numbers - like cx695, 945")
text_df["Description"] = text_df["Description"].apply(lambda text: ' '.join(s for s in text.split() if not any(c.isdigit() for c in s)))
print(f"Removing {new_words_remove}")
print("expected new words to remove ", new_words_remove)
text_df["Description"] = text_df["Description"].apply(lambda text: self.remove_words(text, new_words_remove))
return text_df
text_processed = TextProcessor(safety_data_nlp)
safety_data_nlp_new = text_processed.preprocess(safety_data_nlp.copy())
safety_data_nlp_new['Description'].head()
safety_data_nlp_new.head()
def ngram_func(ngram, trg='', trg_value=''):
#trg_value is list-object
if (trg == '') or (trg_value == ''):
string_filterd = safety_data_nlp_new['Description'].sum().split()
else:
string_filterd = safety_data_nlp_new[safety_data_nlp_new[trg].isin(trg_value)]['Description'].sum().split()
dic = nltk.FreqDist(nltk.ngrams(string_filterd, ngram)).most_common(50)
ngram_df = pd.DataFrame(dic, columns=['ngram','count'])
ngram_df.index = [' '.join(i) for i in ngram_df.ngram]
ngram_df.drop('ngram',axis=1, inplace=True)
return ngram_df
from bokeh.io import output_notebook
output_notebook()
hv.extension('bokeh')
hv.Bars(ngram_func(1)[::-1]).opts(title="Industry Safety : Description -> Unigram Count Top-50 ", color="orange", xlabel="Unigrams", ylabel="Count")\
.opts(opts.Bars(width=700, height=700,tools=['hover'],show_grid=True,invert_axes=True))
hv.extension('bokeh')
hv.Bars(ngram_func(2)[::-1]).opts(title="Industry Safety : Description -> Bigram Count Top-50", color="green", xlabel="Bigrams", ylabel="Count")\
.opts(opts.Bars(width=700, height=700,tools=['hover'],show_grid=True,invert_axes=True))
hv.extension('bokeh')
hv.Bars(ngram_func(3)[::-1]).opts(title="Industry Safety : Description -> Trigram Count Top-50 ", color="pink", xlabel="Trigrams", ylabel="Count")\
.opts(opts.Bars(width=700, height=700,tools=['hover'],show_grid=True,invert_axes=True))
unigram_df = (ngram_func(1)[::-1])
bigram_df = (ngram_func(2)[::-1])
trigram_df = (ngram_func(3)[::-1])
unigram_df.sort_values(by='count', ascending=False)
#df.sort_values(by='col1', ascending=False)
bigram_df.sort_values(by='count', ascending=False)
trigram_df.sort_values(by='count', ascending=False)
wordcloud = WordCloud(width = 1500, height = 800, random_state=0, background_color='black', colormap='rainbow',\
min_font_size=5, max_words=300, collocations=False, stopwords = STOPWORDS).generate(" ".join(safety_data_nlp['Description'].values))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
STOPWORDS.update(["x", "cm", "kg", "mr", "nv", "da", "pm", "am", "cx",
"cause", "employee", "activity","right", "leave",
"worker", "operator", "collaborator",
"one", "two", "second", "third",
"generate", "right", "time", "perform", "moment",
"assistant", "approximate", "describe", "mechanic", "company", "work", "support"])
print(STOPWORDS)
wordcloud = WordCloud(width = 1500, height = 800, random_state=0, background_color='black', colormap='rainbow',\
min_font_size=5, max_words=300, collocations=False, stopwords = STOPWORDS).generate(" ".join(safety_data_nlp_new['Description'].values))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
safety_data_nlp_new.shape
maximum = max(safety_data_nlp_new["Description"].str.split().apply(len))
maximum
len(safety_data_nlp_new['Description'][88].split())
# copy and keep the original dataset
safety_data_nlp_bkup = copy.deepcopy(safety_data_nlp_new)
feature_df1 = copy.deepcopy(safety_data_nlp_new)
feature_df1 = feature_df1[feature_df1['Potential Accident Level'] != 6]
feature_df1.shape
There are only 1 record with potential accident level as 6. Remove the record while processing.
pot_acc_level = {1: 'POTACTA', 2: 'POTACTB', 3: 'POTACTC', 4 : 'POTACTD', 5: 'POTACTE'}
feature_df1['Potential Accident Level'] = pd.Series([pot_acc_level[x] for x in feature_df1['Potential Accident Level']], index=feature_df1.index)
feature_df1.groupby('Potential Accident Level').size()
#X_feat = safety_data_nlp_new.drop(columns=['Accident Level', 'Potential Accident Level'], axis=1)
#X_feat = feature_df1.drop(columns=['Accident Level', 'Potential Accident Level'], axis=1)
#y_label = feature_df1['Potential Accident Level']
X_feat = feature_df1.drop(columns=['Accident Level'], axis=1)
y_label = feature_df1['Accident Level']
X_feat.shape, y_label.shape
#X_concat = X_feat['Country'].map(str) + ' ' + X_feat['Industry Sector'].map(str) + ' ' + X_feat['Gender'].map(str) + ' ' + X_feat['Local'].map(str) + ' ' + X_feat['Critical Risk'].map(str) + ' ' + X_feat['Weekday'].map(str) + ' ' + X_feat['Quarter'].map(str) + ' ' + X_feat['Employee Type'].map(str) + ' ' + X_feat['Description'].map(str)
X_concat = X_feat['Country'].map(str) + ' ' + X_feat['Industry Sector'].map(str) + ' ' + X_feat['Gender'].map(str) + ' ' + X_feat['Local'].map(str) + ' ' + X_feat['Critical Risk'].map(str) + ' ' + X_feat['Employee Type'].map(str) + ' ' + X_feat['Description'].map(str) + ' ' + X_feat['Potential Accident Level'].map(str)
X_concat[0]
maximum = max(X_concat.str.split().apply(len))
maximum
X_concat[2]
X_concat.shape
X_concat.head()
maximum1 = max(X_concat.str.split().apply(len))
maximum1
#### Split the Train and Test set - 80: 20 as the input records are very less
# 7 is just any random seed number, targer y is not balanced hence included stratify =y
#X_train, X_test, y_train, y_test = train_test_split(X_concat, y, test_size=0.1, random_state=7, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_concat, y_label, test_size=0.2, random_state=7)
print('X_concat = ', X_concat.shape, ', X_train = ', X_train.shape, ', X_test = ', X_test.shape )
print('y_label = ', y_label.shape, ', y_train = ', y_train.shape, ', y_test = ', y_test.shape )
X_train.head()
y_label[0]
vocab_size = 20000 # Only consider the top 20k words
maxlen = 100 # Only consider the first 100 words of each work description
# Initialize tokenizer with num_words = 20,000 -1 (keep most common 19,999 words)
tokenizer = Tokenizer(num_words=vocab_size)
# Fit the tokenizer object for X_train that contains headlines attrbutes
tokenizer.fit_on_texts(X_train)
# convert text to sequence - sequence encoding for train and test feature - headlines
train_encoding = tokenizer.texts_to_sequences(X_train)
test_encoding = tokenizer.texts_to_sequences(X_test)
len(train_encoding[0])
num_words = len(tokenizer.word_index) + 1
print(num_words)
print("Pad each headlines with Maximum length = ", maxlen)
X_train = pad_sequences(train_encoding, maxlen=maxlen, padding='post')
X_test = pad_sequences(test_encoding, maxlen=maxlen, padding='post')
X_train[1]
#### Shape of features or headlines #####
print("Overall features shape = " ,X_concat.shape)
print("X_train shape = ", X_train.shape)
print("X_test shape = ", X_test.shape)
#### shape of label records #####
print("Overall labels shape = " , y.shape)
print("y_train shape = ", y_train.shape)
print("y_test shape = ", y_test.shape)
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_train_oversample, y_train_oversample = oversample.fit_resample(X_train, y_train)
unique, counts = np.unique(y_train_oversample, return_counts=True)
dict(zip(unique, counts))
from keras.utils import np_utils
y_train = np_utils.to_categorical(np.asarray(y_train_oversample))
y_test_1 = np_utils.to_categorical(np.asarray(y_test))
print('y_label = ', y_label.shape,',y_train = ',y_train.shape, ', y_test =',y_test_1.shape)
y_train
X_train = X_train_oversample
y_test = y_test_1
print('X = ', X_concat.shape,',X_train = ',X_train.shape, ', X_test =',X_test.shape)
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
super(TransformerBlock, self).__init__()
self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
self.ffn = keras.Sequential(
[layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
)
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(rate)
self.dropout2 = layers.Dropout(rate)
def call(self, inputs, training):
attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
class TokenAndPositionEmbedding(layers.Layer):
def __init__(self, maxlen, vocab_size, embed_dim):
super(TokenAndPositionEmbedding, self).__init__()
self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)
def call(self, x):
maxlen = tf.shape(x)[-1]
positions = tf.range(start=0, limit=maxlen, delta=1)
positions = self.pos_emb(positions)
x = self.token_emb(x)
return x + positions
EMBEDDING_FILE = project_path + '/glove.6B.300d.txt'
embeddings = {}
for o in open(EMBEDDING_FILE):
word = o.split(" ")[0]
# print(word)
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
# print(embd)
embeddings[word] = embd
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 300))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Loaded the pre-trainedGlove embedded weights and store the details in embedding_matrix for processing.
Embedded matrix contains the pre-trained glove embedded weights
embedding_matrix.shape
import time
#### Define the call back ####
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
stop = EarlyStopping(monitor="val_loss", patience=5)
#reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=5, min_lr=1e-5, verbose=1)
reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=5, min_lr=1e-6, verbose=1)
def nlp_lstm_model1():
learning_rate = 0.00099
model = Sequential()
# Embedding layer
model.add(
Embedding(
input_dim=num_words,
output_dim=300,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
# Recurrent layer
model.add(
Bidirectional(
LSTM(
300,return_sequences=True)))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(100))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(50))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(6, activation='softmax'))
adam = optimizers.Adam(lr=learning_rate, decay=1e-6)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
return model
lstm_model1 = nlp_lstm_model1()
lstm_model1.summary()
#### Train the model ####
start = time.clock()
batch_size = 8
lstm_history = lstm_model1.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 16
lstm_history = lstm_model1.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### calculate the accuracy and print ####
lstm_scores = lstm_model1.evaluate(X_test, y_test, verbose=0)
print("Accuracy of the LSTM model : %.2f%%" % (lstm_scores[1]*100))
def nlp_lstm_model2():
learning_rate = 0.00001
model = Sequential()
# Embedding layer
model.add(
Embedding(
input_dim=num_words,
output_dim=300,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
# Recurrent layer
model.add(
Bidirectional(
LSTM(
300,return_sequences=True)))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.6))
model.add(Dense(100))
#model.add(BatchNormalization())
#model.add(Activation('relu'))
#model.add(Dropout(0.6))
model.add(Dense(50))
#model.add(BatchNormalization())
#model.add(Activation('relu'))
#model.add(Dropout(0.6))
model.add(Dense(6, activation='softmax'))
adam = optimizers.Adam(lr=learning_rate)#, decay=1e-6)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
return model
lstm_model2 = nlp_lstm_model2()
lstm_model2.summary()
#### Train the model ####
start = time.clock()
batch_size = 8
lstm_history = lstm_model2.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### calculate the accuracy and print ####
lstm_scores = lstm_model2.evaluate(X_test, y_test, verbose=0)
print("Accuracy of the LSTM model : %.2f%%" % (lstm_scores[1]*100))
def nlp_lstm_model3():
#learning_rate = 0.00099
learning_rate = 0.00001
filters = 32
kernal_size = 3
model = Sequential()
# Embedding layer
model.add(
Embedding(
input_dim=num_words,
output_dim=300,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
model.add(Dropout(0.2))
model.add(
Conv1D(
filters,
kernal_size,
padding = 'valid',
activation ='relu'))
model.add(
MaxPooling1D())
# Recurrent layer
model.add(
Bidirectional(
LSTM(
300,return_sequences=True)))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(100))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(50))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(6, activation='softmax'))
adam = optimizers.Adam(lr=learning_rate)
#adam = optimizers.Adam(lr=learning_rate, decay=1e-6)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
return model
lstm_model3 = nlp_lstm_model3()
lstm_model3.summary()
#### Train the model ####
start = time.clock()
batch_size = 4
lstm_history = lstm_model3.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 8
lstm_history = lstm_model3.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 16
lstm_history = lstm_model3.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 32
lstm_history = lstm_model3.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### calculate the accuracy and print ####
lstm_scores = lstm_model3.evaluate(X_test, y_test, verbose=0)
print("Accuracy of the LSTM model : %.2f%%" % (lstm_scores[1]*100))
def nlp_lstm_model4():
#learning_rate = 0.00099
learning_rate = 0.00001
filters = 256
kernal_size = 5
model = Sequential()
# Embedding layer
model.add(
Embedding(
input_dim=num_words,
output_dim=300,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
#model.add(Dropout(0.1))
model.add(
Conv1D(
filters,
kernal_size,
padding = 'valid',
activation ='relu'))
model.add(
MaxPooling1D())
model.add(
Conv1D(
filters,
kernal_size,
padding = 'valid',
activation ='relu'))
model.add(
MaxPooling1D())
# Recurrent layer
model.add(
Bidirectional(
LSTM(
300,return_sequences=True)))
model.add(Flatten())
model.add(Dense(1000, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(250))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(6, activation='softmax'))
#adam = optimizers.Adam(lr=learning_rate, decay=1e-6)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
lstm_model4 = nlp_lstm_model4()
lstm_model4.summary()
#### Train the model ####
start = time.clock()
batch_size = 4
lstm_history = lstm_model4.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 8
lstm_history = lstm_model4.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 16
lstm_history = lstm_model4.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### calculate the accuracy and print ####
lstm_scores = lstm_model4.evaluate(X_test, y_test, verbose=0)
print("Accuracy of the LSTM model : %.2f%%" % (lstm_scores[1]*100))
def nlp_transformer_model1():
## hyperparameters
learning_rate = 0.00099
Lambda = 0.00029
embed_dim = 32 # Embedding size for each token
num_heads = 2 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer
inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(200)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(100)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(50)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(6, activation="softmax",kernel_regularizer=regularizers.l2(Lambda))(x)
model = keras.Model(inputs=inputs, outputs=outputs)
adam = optimizers.Adam(lr=learning_rate, decay=1e-6)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
return model
trns_model = nlp_transformer_model1()
trns_model.summary()
#### Train the model ####
start = time.clock()
batch_size = 4
trns_history = trns_model.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 8
trns_model.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### Train the model ####
start = time.clock()
batch_size = 16
trns_model.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### calculate the accuracy and print ####
scores = trns_model.evaluate(X_test, y_test, verbose=0)
print("Accuracy of the model : %.2f%%" % (scores[1]*100))
#### predict the labels for test data ####
predicted_labels = trns_model.predict(X_test)
print(predicted_labels[5])
predicted_class = predicted_labels
predicted_labels.shape, predicted_class.shape
predicted_labels[10]
y_test[10]
y_test.shape
import keras
from keras.models import Sequential
import tensorflow as tf
from keras.layers import Dense, Activation, Dropout,Input
from keras.layers.convolutional import Conv1D
from tensorflow.keras import layers
print(safety_data_nlp.info()) #X_train, X_test, y_train, y_test
print(safety_data_ml.info())
def nlp_rnn_model():
learning_rate = 0.00099
model = Sequential()
# Embedding layer
model.add(
Embedding(
input_dim=num_words,
output_dim=300,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
# Recurrent layer
model.add(
keras.layers.recurrent.SimpleRNN(
300,return_sequences=True))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(100))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(50))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(6, activation='softmax'))
adam = optimizers.Adam(lr=learning_rate, decay=1e-6)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
return model
rnn_model = nlp_rnn_model()
rnn_model.summary()
batch_size = 32
rnn_history = rnn_model.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
rnn_scores = rnn_model.evaluate(X_test, y_test, verbose=0)
print("Accuracy",rnn_scores[1])
Let us proceed with the following models to classify the problem and validate which one is outperforming.
Import statements for ML model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, metrics, svm
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC, LinearSVC
safety_data_tfidf_ml = safety_data_ml.copy()
safety_data_tfidf_ml.info()
def clean_text(text):
# remove_URL
url = re.compile(r'https?://\S+|www\.\S+')
text = url.sub(r'', text)
# remove_html
html = re.compile(r'<.*?>')
text = html.sub(r'', text)
# remove_emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags = re.UNICODE)
text = emoji_pattern.sub(r'', text)
# remove_punct
table = str.maketrans('', '', string.punctuation)
text = text.translate(table)
return text
safety_data_tfidf_ml['text'] = safety_data_tfidf_ml['Description'].apply(lambda x : clean_text(x))
column_preprocessor = ColumnTransformer(
[
('text_tfidf', TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}'), 'text'),
],
remainder='drop',
n_jobs=1
)
#afety_data_temp = pd.DataFrame(column_preprocessor.fit_transform(safety_data_tfidf_ml))
#type(safety_data_temp)
#safety_data_temp.info()
Split data from dataframe
safety_data_tfidf_ml.info()
X_ml = safety_data_tfidf_ml.drop(['Potential Accident Level','Accident Level'], axis=1)
#y_ml = safety_data_tfidf_ml['Potential Accident Level']
y_ml = safety_data_tfidf_ml['Accident Level']
SEED = 40
# Train-Test split
X_train_ml, X_test_ml, y_train_ml, y_test_ml = model_selection.train_test_split(X_ml, y_ml, test_size = 0.15, random_state=SEED)
#unique, counts = np.unique(y_train_ml1, return_counts=True)
#dict(zip(unique, counts))
#from imblearn.over_sampling import SMOTE
#oversample = SMOTE()
#X_train_ml, y_train_ml = oversample.fit_resample(X_train_ml1, y_train_ml1)
#unique, counts = np.unique(y_train_ml, return_counts=True)
#dict(zip(unique, counts))
pipeline = Pipeline([
('column_preprocessor', column_preprocessor),
('svm', svm.SVC(kernel='rbf', C=10, gamma=1.211))
])
# Training
pipeline.fit(X_train_ml, y_train_ml)
predictions_ml_svm = pipeline.predict(X_test_ml)
print(metrics.accuracy_score(y_test_ml, predictions_ml_svm))
pipeline3c = Pipeline([
('column_preprocessor', column_preprocessor),
('LR', LogisticRegression(n_jobs=1, C=1e5,class_weight='balanced',multi_class='multinomial'))
])
# Training
pipeline3c.fit(X_train_ml, y_train_ml)
predictions3c_LR = pipeline3c.predict(X_test_ml)
print(metrics.accuracy_score(y_test_ml, predictions3c_LR))
pipeline_RF = Pipeline([
('column_preprocessor', column_preprocessor),
('RFC', RandomForestClassifier(max_depth=150,max_leaf_nodes=2, random_state=0))
])
# Training
pipeline_RF.fit(X_train_ml, y_train_ml)
predictions_RF = pipeline_RF.predict(X_test_ml)
print(metrics.accuracy_score(y_test_ml, predictions_RF))
pipeline_onerest = Pipeline([
('column_preprocessor', column_preprocessor),
('svm', OneVsRestClassifier(LinearSVC(loss='hinge',random_state=42,class_weight='balanced')))
])
# Training
pipeline_onerest.fit(X_train_ml, y_train_ml)
predictions_onerest = pipeline_onerest.predict(X_test_ml)
print(metrics.accuracy_score(y_test_ml, predictions_onerest))
def nlp_nn_model():
num_labels = 6
model = Sequential()
model.add(Dense(512, input_shape=(maxlen,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(num_labels))
model.add(BatchNormalization())
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
model_nn = nlp_nn_model()
#### Train the model ####
start = time.clock()
batch_size = 16
model_nn.fit(X_train, y_train, epochs = 100, batch_size=batch_size, validation_data=(X_test, y_test), verbose = 2, callbacks = [stop, reduce_lr])
end = time.clock()
print('Time spent:', end-start)
#### calculate the accuracy and print ####
scores = model_nn.evaluate(X_test, y_test, verbose=0)
print("Accuracy of the model : %.2f%%" % (scores[1]*100))
prediction_nn = model_nn.predict(X_test)
predictions_nn = np.argmax(prediction_nn, axis = 1)
y_test_labels = np.argmax(y_test, axis =1)
y_test[0]
np.argmax(y_test[0])
# Map of model names and models
ml_models = {
'Logistic Regression': pipeline3c,
'SVM Model' : pipeline,
'OneVsRest with SVM' : pipeline_onerest,
'Random Forst' : pipeline_RF
}
nlp_models = {
'NN Model' : model_nn,
'LSTM Model1' : lstm_model1,
'LSTM Model2' : lstm_model2,
'LSTM Model3' : lstm_model3,
'LSTM Model4' : lstm_model4,
'Transformer Model ' : trns_model
}
# function definition
def evaluate_model_performance(y_test, y_predict, _labels=[1, 2, 3, 4, 5, 6], _average='weighted'):
score = metrics.accuracy_score(y_test, y_predict)
precision = metrics.precision_score(y_test, y_predict, labels=_labels, average=_average)
recall = metrics.recall_score(y_test, y_predict, labels=_labels, average=_average)
f_score = metrics.f1_score(y_test, y_predict, labels=_labels, average=_average)
print(f'Accuracy score = {score}, Precision score = {precision}, Recall score = {recall}, F-measure score {f_score}')
return score, precision, recall, f_score
def evaluate_all(models_map, X_test, y_test, model_type):
model_list = []
model_performances = []
if model_type == 'NLP':
y_test = np.argmax(y_test, axis =1)
for model_name in models_map:
model_list.append(model_name)
y_pred = (models_map[model_name]).predict(X_test)
if model_type == 'NLP':
y_pred = np.argmax(y_pred, axis =1)
metrics = evaluate_model_performance(y_test, y_pred)
model_performances.append(metrics)
summary = pd.DataFrame(model_performances,
model_list,
['Accuracy', 'Precision', 'Recall', 'F-score'])
return summary
ml_result = evaluate_all(ml_models, X_test_ml, y_test_ml, 'ML')
nlp_result = evaluate_all(nlp_models, X_test, y_test, 'NLP')
combined_result = pd.concat([ml_result, nlp_result])
#combined_result
combined_result.sort_values(by=['Precision', 'Accuracy'], ascending=False, inplace=True )
combined_result
def predict_lstm(x,y_test,model):
x['text'] = x['Description'].apply(lambda x : clean_text(x))
tokenizer = Tokenizer(num_words=vocab_size)
text = x['text']
# Fit the tokenizer object for X_train that contains headlines attrbutes
tokenizer.fit_on_texts(text)
# convert text to sequence - sequence encoding for train and test feature - headlines
train_encoding = tokenizer.texts_to_sequences(text)
text = pad_sequences(train_encoding, maxlen=maxlen, padding='post')
predictions = model.predict(text)
predictions = np.argmax(predictions, axis = 1)
print("\n LSTM Prediction ",predictions)
print("\n Actual saftey Accident Level Value", safety_data["Accident Level"][4])
sample_df = pd.DataFrame([safety_data["Description"][4]],columns=['Description'])
predict_lstm(sample_df,y_test,lstm_model4)
lstm_model2.save(project_path + 'chatbot_model_al.h5', lstm_history)
Summary: